139 research outputs found
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
This paper proposes a new strategy for learning powerful cross-modal
embeddings for audio-to-video synchronization. Here, we set up the problem as
one of cross-modal retrieval, where the objective is to find the most relevant
audio segment given a short video clip. The method builds on the recent
advances in learning representations from cross-modal self-supervision.
The main contributions of this paper are as follows: (1) we propose a new
learning strategy where the embeddings are learnt via a multi-way matching
problem, as opposed to a binary classification (matching or non-matching)
problem as proposed by recent papers; (2) we demonstrate that performance of
this method far exceeds the existing baselines on the synchronization task; (3)
we use the learnt embeddings for visual speech recognition in self-supervision,
and show that the performance matches the representations learnt end-to-end in
a fully-supervised manner.Comment: Preprint. Work in progres
FaceFilter: Audio-visual speech separation using still images
The objective of this paper is to separate a target speaker's speech from a
mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled
speaker information as an auxiliary conditional feature, we use a single face
image of the target speaker. In this task, the conditional feature is obtained
from facial appearance in cross-modal biometric task, where audio and visual
identity representations are shared in latent space. Learnt identities from
facial images enforce the network to isolate matched speakers and extract the
voices from mixed speech. It solves the permutation problem caused by swapped
channel outputs, frequently occurred in speech separation tasks. The proposed
method is far more practical than video-based speech separation since user
profile images are readily available on many platforms. Also, unlike
speaker-aware separation methods, it is applicable on separation with unseen
speakers who have never been enrolled before. We show strong qualitative and
quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples:
https://youtu.be/ku9xoLh62
MIRNet: Learning multiple identities representations in overlapped speech
Many approaches can derive information about a single speaker's identity from
the speech by learning to recognize consistent characteristics of acoustic
parameters. However, it is challenging to determine identity information when
there are multiple concurrent speakers in a given signal. In this paper, we
propose a novel deep speaker representation strategy that can reliably extract
multiple speaker identities from an overlapped speech. We design a network that
can extract a high-level embedding that contains information about each
speaker's identity from a given mixture. Unlike conventional approaches that
need reference acoustic features for training, our proposed algorithm only
requires the speaker identity labels of the overlapped speech segments. We
demonstrate the effectiveness and usefulness of our algorithm in a speaker
verification task and a speech separation system conditioned on the target
speaker embeddings obtained through the proposed method.Comment: Accepted in Interspeech 202
HD-DEMUCS: General Speech Restoration with Heterogeneous Decoders
This paper introduces an end-to-end neural speech restoration model,
HD-DEMUCS, demonstrating efficacy across multiple distortion environments.
Unlike conventional approaches that employ cascading frameworks to remove
undesirable noise first and then restore missing signal components, our model
performs these tasks in parallel using two heterogeneous decoder networks.
Based on the U-Net style encoder-decoder framework, we attach an additional
decoder so that each decoder network performs noise suppression or restoration
separately. We carefully design each decoder architecture to operate
appropriately depending on its objectives. Additionally, we improve performance
by leveraging a learnable weighting factor, aggregating the two decoder output
waveforms. Experimental results with objective metrics across various
environments clearly demonstrate the effectiveness of our approach over a
single decoder or multi-stage systems for general speech restoration task.Comment: Accepted by INTERSPEECH 202
- …